Unsupervised Discrimination of Person Names in Web Contexts
نویسندگان
چکیده
Ambiguous person names are a problem in many forms of written text, including that which is found on the Web. In this paper we explore the use of unsupervised clustering techniques to discriminate among entities named in Web pages. We examine three main issues via an extensive experimental study. First, the effect of using a held–out set of training data for feature selection versus using the data in which the ambiguous names occur. Second, the impact of using different measures of association for identifying lexical features. Third, the success of different cluster stopping measures that automatically determine the number of clusters in the data.
منابع مشابه
Unsupervised Discrimination and Labeling of Ambiguous Names
This paper describes adaptations of unsupervised word sense discrimination techniques to the problem of name discrimination. These methods cluster the contexts containing an ambiguous name, such that each cluster refers to a unique underlying person or place. We also present new techniques to assign meaningful labels to the discovered clusters.
متن کاملName Discrimination and Email Clustering using Unsupervised Clustering and Labeling of Similar Contexts
In this paper, we apply an unsupervised word sense discrimination technique based on clustering similar contexts (Purandare and Pedersen, 2004) to the problems of name discrimination and email clustering. Names of people, places, and organizations are not always unique. This can create a problem when we refer to or seek out information about such entities. When this occurs in written text, we s...
متن کاملDiscovering Identities in Web Contexts with Unsupervised Clustering
We describe the application of unsupervised clustering methodologies to the problem of discriminating among ambiguous names found in short passages of text that appear on Web pages. We show how to tailor these methods to handle the very noisy data that we typically find on the Web. We experiment with several variations in feature selection, two methods that automatically determine the number of...
متن کاملImproved Unsupervised Name Discrimination with Very Wide Bigrams and Automatic Cluster Stopping
We cast name discrimination as a problem in clustering short contexts. Each occurrence of an ambiguous name is treated independently, and represented using second–order context vectors. We calibrate our approach using a manually annotated collection of five ambiguous names from the Web, and then apply the learned parameter settings to three held-out sets of pseudo-name data that have been repor...
متن کاملExtracting Key Phrases To Disambiguate Personal Name Queries In Web Search
Assume that you are looking for information about a particular person. A search engine returns many pages for that person’s name. Some of these pages may be on other people with the same name. One method to reduce the ambiguity in the query and filter out the irrelevant pages, is by adding a phrase that uniquely identifies the person we are interested in from his/her namesakes. We propose an un...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007